PDF-Text-Extractor

PDF-Text-Extractor

This GUI Application allows you to extract the texgt from the PDF files. The project is build using the PyPDF2 library for extracting text from PDFs, and the tkinter library for creating the GUI.

Getting Started

To run the project, you will need to have Python and pip installed on your system.

Installation

  1. Clone or download the repository to your local machine.

    git clone https://github.com/SamAddy/PDF-Extract-Text.git
  2. Enter the working directory.

    cd PDF-Extract-Text
  3. Use pip to install the required libraries.

    pip install -r requirements.txt

Usage

  1. Run the app using the following command:

    python app.py
  2. A GUI window will appear, with a button to selecgt the PDF file you want to extract text from.

  3. Once you have selected the file, the text will be extracted and displayed in the text box.

  4. You can also save the text to a file by clicking ‘Save’ button.

image1 image2

Note

Please keep in mind that not all pdfs are created equal, and some pdfs may have text in an image format or other format that may not be extractable with PyPDF2.

Built With

  • Python - The programming language used.
  • PYPDF2 - A library for extracting text from PDF files.
  • Tkinter - A library for creating GUI in Python.

Contributing

Contributions are absolutely welcome. If you have an idea for an improvement, please open an issue or submit a pull request.

Acknowledgement

Source Code: app.py

import tkinter as tk
import PyPDF2
from PIL import Image, ImageTk
from tkinter.filedialog import askopenfile

root = tk.Tk()
root.title('PDF to TEXT')
root.iconbitmap('./logo.png')
root.resizable(False, False)


canvas = tk.Canvas(root, width=600, height=400)
canvas.grid(columnspan=3, rowspan=3)

# Insert logo into the window
logo = Image.open('logo2.png')
logo = ImageTk.PhotoImage(logo)
logo_label = tk.Label(image=logo)
logo_label.image = logo
logo_label.grid(column=1, row=0)

# instructions
instructions = tk.Label(root, text='Select a PDF file on your device to extract all its text.', font='calibre')
instructions.grid(columnspan=3, column=0, row=1)

# Get the PDF file on device
browse_text = tk.StringVar()
browse_btn = tk.Button(root, textvariable=browse_text, command=lambda: open_file(), font='calibre', bg='red', width=15, height=2)
browse_text.set('Browse')
browse_btn.grid(column=1, row=2)

canvas = tk.Canvas(root, width=600, height=200)
canvas.grid(columnspan=3, rowspan=3)


def open_file():
    browse_text.set('On it...')
    # Open the PDF file using the PdfFileReader object
    file = askopenfile(parent=root, mode='rb', title='Choose a file', filetypes=[('PDF file', '*.pdf')])
    text = ""

    if file:
        read_pdf = PyPDF2.PdfReader(file)
        for i in range(len(read_pdf.pages)):
            text += read_pdf.pages[i].extract_text()

        text_box = tk.Text(root, height=10, width=50, padx=15, pady=15)
        text_box.insert(1.0, text)
        text_box.tag_config('center', justify='center')
        text_box.tag_add('center', 1.0, 'end')
        text_box.grid(column=1, row=3)

        browse_text.set('Browse')


def convert_to_docx():
    pass


root.mainloop()